其他
GATK之FastaStats和CountReads
GATK的本职工作是Variant calling,但是就像我之前所说的,它作为基因组分析工具箱,还是有很多其他工具,今天学习的是诊断和质量控制工具的其中两个:CountReads,FastaStats。
FastaStats
功能: 计算参考基因组的基本统计值
分类: 诊断和质量控制工具
概要: 主要就是统计碱基总数,和常规的碱基数(ATCG)
吐槽:功能还真是简单。。
输入: FASTA参考文件
输出: 结果要么输出到屏幕,要么是输出到(-o)到文件中。
使用案例:我看了一下拟南芥的参考基因组
java -jar ~/biosoft/GenomeAnalysisTK.jar -T FastaStats \
-R TAIR10.fa
输出:
Total bases 119667750
Regular bases 119481543
CountReads
功能: 计算reads数
分类: 诊断和质量控制工具
概要: 最好和--read-filter
合用,这样可以了解下符合特定标准的reads数
输入: 一个或多个BAM文件
输出: 结果会输出到屏幕(标准输出)上,毕竟是用来确定阈值的,也不需要一定要输出到文件中
案例:不加参数--read-filter
java -jar ~/biosoft/GenomeAnalysisTK.jar -T CountReads\
-R $work/database/TAIR10/TAIR10.fa\
-I BC_bg_reads.sorted.bam
输出:
CountReads - CountReads counted 55080781 reads in the traversal
--read-filter/-rf
后面可以接很多的选项,官方文档列出了如下内容:
Name | Summary |
---|---|
Filter out reads with wonky CIGAR strings | |
Filter out reads whose mate maps to a different contig | |
Filter out duplicate reads | |
Filter out reads that fail the vendor quality check | |
Filter out reads with low mapping qualities for HaplotypeCaller | |
Only use reads from the specified library | |
Filter out malformed reads | |
Filter out reads with low mapping qualities | |
Filter out reads with no mapping quality information | |
Filter out reads with mapping quality zero | |
Filter out reads with bad pairing (and related) properties | |
Filter out reads that exceed a given insert size | |
Filter out reads without read group information | |
Filter out reads that do not have an original quality quality score (OQ) tag | |
Filter out read records that are secondary alignments | |
Filter out reads that are over-soft-clipped | |
Filter out reads produced by 454 technology | |
Filter out reads that were generated by a specific sequencing platform | |
Filter out reads with blacklisted platform unit tags | |
Filter out reads matching a read group tag value | |
Filter out reads based on length | |
Only use reads with this read name | |
Filter out reads based on strand orientation | |
Set the mapping quality of all reads to a given value | |
Set the mapping quality of reads with a given value to another given value | |
Revert the MQ of reads that were modified by IndelRealigner | |
Only use reads belonging to a specific sample | |
Only use reads from the specified read group | |
Filter out unmapped reads |
使用的时候记住,对于XXXXFilter,你需要点每个过滤选项进去看下具体用法,比如说我需要过滤比对质量低于20的,我发现MappingQualityFilter的用法解释是
java -jar GenomeAnalysisTk.jar \
-T HaplotypeCaller \
-R reference.fasta \
-I input.bam \
-o output.vcf \
-rf MappingQuality \
-mmq 15
那么我前面的代码就可以改为
java -jar ~/biosoft/GenomeAnalysisTK.jar -T CountReads \
-R $work/database/TAIR10/TAIR10.fa -I BC_bg_reads.sorted.bam \
-rf MappingQuality -mmq 15
结果:
8190205 reads were filtered out during the traversal out of
approximately 55080781 total reads (14.87%)
果然加上比对质量后,就有一些要被过滤掉呢。
基本上GATK每一个工具都有一些默认过滤选项,你可以在每个工具的额外信息部分进行了解。
今天就介绍这两个工具,顺便提了GATK的过滤功能。白了个白,每天学个GATK的任务完成。